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Perfonaance Effectiveness 
1 

Toward the Development of Realistic Measures 
/ of Performance •Effectiveness 

The topic about which I would like tp coinsient today, concerns 
methods" for eva^uatIhg the effectiveness of individual performance. 
The individual of whom I speak, might for example, be an electronics 
technician on a radar system, a medical student, a second grader, or 
a pianist. The context in which the individual functions; while im- 
portant, and most certainly a variable in the equation which describes 
pferformance effectiveness; is not the item of interest here. Instead, 
I would like to discuss some generalized topics concerned with the 
assessmWt of - individual performance. Notice that I speak of perfor- 
mance, not of attitudes, knowledge, abilities, or other so-called in- 
tervening variables. I believe that we should be concerned with per- 

i 

formance' outcomes; with actions or statements which we can observe, 
define, and measure. Although this insistence may be viewed as severe- 
ly restricting the applicability of certain measurement models, ... 
so be it. I do not wish to speculate about the effects of personality, 
attitudes and the like, on performance. While these are interesting ^ 
and possibly productive areas to pursue^ they 'are not the topic of 

cencern here. ^ 

Before proceeding, let me introduce sortJe terms* These terms are 
descriptive ones, which define various models of performance (and 
other) testing. I would like to review them briefly. 
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'The most widely used model for assessing individual achievement 
is generally referred to as Korm-Ref erenced rfeasuren^nt (NRM) . In 
lIRM, the performance of an igdi^vidual is typically considered rela- 
tive to the performance of other comparable individuals. This- model 
has benefited from many years of psychometric research. It is useful 
for making decisions among individual attainment, and for comparing 
individuals to nonaative distributions;' It allows for the possibility 
of ranking persons according to competence on specific tasks, or on 
more general measures of achievement. In cases where relative decisions 
must be made; such as selection, promotion, pay level judgments, class 
rankings, and other discriminations among individuals, NRM is the model 
of choice; 

For example, if we have a test vhere local norms have been computed 
over a period of time, and we discover that an individual's test score 
is at' the ninetieth percentile of that distribution, we may conclude 
that the person of interest is doing better than about ninety percent 
o^ the individuals in the population. A key emphasis of norm-referenced 
measurement,, is to maximize individual differences so th^t one can 
spread the distribution of fest scores. Norm-referenced items thus, 
are designed to discriminate, and are often chosen to be of moderate 
or extreme difficulty. .' - ^ 

Unfortunately, however necessary NRM may be in performance eval- 
uation systeirts, it is not sufficient. Many educational institutions 
I 

are finding t^mselves in the position where minimal -required levels 
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of coapetence are not being met. It is widely alleged, for example, 
that nany high school graduates cannot read acceptably. To the ex- 
tent that this (and similar, allegations) are correct, th^y are diffi- 
cult to detect with a norm-referenced model. The reason is that 
absolute performance standards are not specified in J^RM* No external 
criterion exists, \gainst which to assess individual performance, A 
different measurement model, termed Criterion-Referenced Measurement 

* (CRM), is appropriate. A criterion-referenced test measures what an 
individual can do, or knows, compared to ^hat he must be able to do, 

^ or must know, in order to successfully complete a task. Basically, 
this means that an individual's 'performance is compared, or referenced 
to some external criterion, or performance standard. Such standards 
are derived directly from an analysis of what. is required to perform 
a particular task successfully. In CRM, performance is interijreted 
against an absolute standard without regard to the distribution^ of 
scores attained by other individuals. 

The distinction between NRM and CRM has been aptly illustrated 
by Popham and Husek (1969) using the analogy of a dog owner who wants 
to keep his dog in the back^'yard. The owner finds out how high the . 
dog can jump (a criterion-referenced test) and builds a fence high 
enough to keep the dog in the 'back yard. How high the dog can jump 
compared to other dogs (a norflT^ref erenced test) is irrelevant, Be-^ 
ginnitig with Glaser (1963) I number of researchers haye made similar 
distinctions. Glaser and Nitko (19?!, P- 653) for example, have ^ 
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described a criterion-referenced test as "One that is deliberately 
constructed to yield measurements that are directly interpretable 
in terms of specified performance standards^" This definition has 
been slightly expanded -by Livingston (1972, p. 13) "Criterion- 
referenced (is) used to refer to any test for whic6 a criterion 
score is specified without reference to the distribution of scores 
of a group of examinees." Common to all definitions is the notion 
that a well-defined content area and the development of procedures 
for generating appropriate samples of test items are important. 

Two other models of performance specification will also be 
mentioned. Domain-Referenced Measurement (DRM)_, has been defined 
by Sanders and Murray (1976) as "a test in which performance on a 
task in interpreted by referencing a well-defined set of tasks (a 
domain)." Domain-referenced tests thus, are tests which emphasize 
the creation of item pools or item formp, ' representative of a uni- 
verse of all test items for a well-defined content area. 

Another model, Objectives-Referenced Measurement (ORM) , is gen- 
erally considered as measurement in which performance is interpreted* 
by referencing the behavioral objective(s), for which the item wa^ 
written. Objectives-referenced tests emphasize test items which are* 
derived directly from predetermined behaviors. ORTs thus, are tests 
whose items are operational definitions of behavioral objectives. 
(See Sanders and Murray, 1976, for a further discussion o|^these 
topics.) 
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It appears thus, that domain- anS' objectives-referenced measure- 
ment refer generally to the content which' the test was developed to 
assess. Norm- and criterion-referenced measurement, on the other 
hand, refer generally to the way in which a fest score is interpreted, 

regardless of content. 

_ _ _ _ • 

Many sophisticated models for the development and validation 
of achievement measures exist. The problem with many of these models 
in ever^ day situations, is that their esoteric nature and complicated 
proceduUs often serve to minimize their utility. Classroom teachers, 
it is alleged, rarely cpnsider questions of reliability and validity 
in their test develojjm^nt activities.' One reason for this may be that 
the establishment of /test reliability and validity generally involves 
complicated procedu/es, as well as a great deal of work. It is often 
neither cost-ef f e/tive nor time-effective for a public school teacher 
to compute item Statistics or test reliability and validity coeffi- 
cients. A typ/cal approach to test development in applied educational^ 
contexts, is Simply: (1) to determine the domafn which one wants to 
test, (2) to write a number of test items relevant to that domain, 

(3) to administer the test to the appropriate student population, 

(4) to score the test as objectively and unbiasedly as possible, and 

(5) to arbitrarily establis^ tut ting points for the grade distribution. 
It is here suggested tha't this' may be a reasonable approach if one's 
purpose in developing the test, is- a norm- and/or domain-referenced 
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one. If however, one is concerned about objectives, and criterion- 
referenced raeastirement (and I firmly believe that we must be con-- 
cerned about these aspects) the approach is generally inappropriate. 

It is not my purpose today to go intd all diverse components 
• * 

of individual performance measurement systems, but to describe brief- 
ly certain aspects of '?ii>jective^oriented , criterion-referenced systems 
which I believe to be of general interest. The areas about which I 
would like to (c^nment, are often considered to be troublesome ones. 
They have generated a -great deal of discussion and comment, yet so 
^far as I know, there exists today no general agreement concerning 
their solution. 

\ 

Objectives . First, "fet us consider behavioral objectives, It 
is my belief that adequate behavioral 
speaking, be divided into three compon: 
conditions and standards. 

Performances . Ever^ objective shDuld state. precisely what the 
individual must do. The statement of performance must be cleWr enough 



objectives can, generally, 
2nt-s. These are: performances, 

1. . 



for that performance to be trained and 
performance statements are: . climb the 

ditions under which a tourniquet shoulc be applied; add two 5-digit 
numbers, ..^ etc. Every statement of performance should include an 

4 

action vefli. This verb is the key to the performance. It tells what 
THust be done. In the example "state th 



tested. Examples of adequate 
telephone pole; state the con- 



e conditions under which a 



rourniquet should be applied," the actibn verb is "state". You can 
actually test 'a student^s ability to stAte the required .conditions. 
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Suppose' that the statement of performance had read, "appreciate th^ 
situations in which a tourniquet should bei applied." Would you >ncy^ 
what to test? How would you know* when a student appreciates situations 

Conditions . Every objective should- also 'include a statement of 
the conditions under which the performance 'must be demonstrated. Such 
statements should indicate: (a) what the student has to wor-k with (or 
what he is allowed to use), (b) the circumstances under which the per- 
formance must be demonstrated, (c) what the student must work on (his 

I 

starting points), and (d) limitations or special instructions. It is 
extremely important for an objective to specify all conditions which 
may affect performance. Without statements of the conditions, one 
cannot be sure of what to teach or test. Supposes for example, that 
an objective stated, "compute the square root of the number 125." 
You, the student, have received training in the computation of square 
roots, and are ready to be tested. An unknown examiner takes you to 
a room, closes the door, and asks you to compute ^the appropriate square 
root. Your response ... "But, during my training in square root com- , 
putation, I had access to ^ calculator." The examiner-^s answer, '^It 
is important to be able to compute square roots under .any circumstance; 
you won^t always have a calculator*" 

The point of this rather ^simple example is that, if conditions 
aren^t specified, the student won^t know exactly what he needs to learn 
to do, and the test developer won't know just what it is he; should test 
A precise specification of the conditions under which the performance 
must be demonstrated is critical. 
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Standards > Thirdly, each objective should specify precisel;?^ 
the standard or criterion against which performance is to be evalu- 
ated. As is^e case for statements of performance and conditions, 
standards too must be clearly stated in the object;Lve, For an exam- 
ple, suppose that an objective stated, "Be able to type accurately 
using an electric typewriter under standard office conditions/' 
Lacking standards for speed and accuracy, how fast, would you** train 
people to type in order to satisfy the objective? How fast would they 
have to type to be able to pass your criterion-referenced test? Ob- ^ 
viously, the statement is lacking a clear statement of standards, 
•'Accurately" doesn'e really tell you anything. A complete objective 
might read: "Using an electric typewriter under standard office con- 
ditions, be able'Uo t/pe 50 words per miliute, corrected for accuracy 
'(tihat is, one word subtracted for each mistake). Working from such 
an objective, you would know what standards to shoot for in training, 
nd the level of. performance the examinee must demonstrate on the test. 
A. final comment on objectives, is that they must be unitary. 
They ;sho'uld cover one task or task aspect only. To check that ob- 
jectives are unitary, one should examine the parts that describe the 
performance. Looking at^the performance required by a given objec- 
five, one .might ask oneself tl^e following two questions: (a) Does 
the 9bjective call for performance on-just one task? (b) .Are all 
tasks independent (that is, success on one objective does not require 
successful performance on the preceding one)? If the answer to either. 
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question is -a definite "no", the objectives are probably not uni- , 
tarv and need to be broken down into unitary ones. 
Item Format and Level of Fidelity . The second topic which I 
wish to comment upon, concerns item format and level of fidelity. 
Before constructing test items, the developer is typically faced with 
questions of item format. Do we want paper and pencil items, "hands- 
on" performance items, multiple choice items, reqall measures, job 
simulation, supervisor ratings, or what? Virtually any of these 
formats can be adapted to a testing situation. There may be others 
that are even more appropriate. How to choose? these are questions 
involving item format and test fidelity. 

The term fidelity addresses the ex^tent to which a test resembles 
« 

the actual objective or performance being examined. The more the test 
resembles the performance in question, the higher the fidelity of the 
test. Here is one place where practical testing constraints have a direct 
impact on test development. If, for example, it is too clostly to 
use an actual aircraft for maintenance tests, and one must therefore 
use a simulator, one loses fidelity unless the simulator is very much 
like the actual aircraft in terms of required performance. To the 
extent that the performances Required on the simulator approach those 
required on the actual 'equipment, the fidelity loss is minimized. 

Fieueriksen (1962) has proposed a multiple level classification 
of fidelity in^ performance testing. The first category (and lowest 
fidelity .level) is to solicit opinions* This category may in fact often 
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miss the payoff question (e.g., to- what extent has the behaviof K5f 
trainees been modified as a function of the. instructional.Jiroeess) . 
The second category is to administer -^itude scales. This technique, al- 
' tt^ough. psycjiometrically refined. via the work of Thurstone,' Likert, ^ 

"Guttman and others, assesses primarily a psycWogical concept (atti- 
tude) which is presumed to be concomitant with performance. Third is to^ 
measure knowledge. This is. without doubt, the most commonly used method- 
of assessing achievement. This technique is usually considered adequate 
however, -only if the training objective is to produce JcnpiTedge. Fourth: 
elicit related behavior. This approacji is often used in situations 

• vhere, due to practical ^^iderations ,' one must'resort to observation 
of behavior which is thought to be logically reUted to the crit^n f ' 
behavior. Fifth: elicit '"What 1 Would Do" behavior. This tethniqlie \ 
usually involves the presentati/n of brief descriptions of- problem sit-\ 
uations or scenarios, under simulated predesigned conditions, 'and 

requires a subject to indicate what he would -do to solve the problems 
if he were in the situation. And finally, at the highest fidelity 
Jlevel — elicit lifelike behavior. This category includes behavioral 
assessment under conditions which approach the realism of the life 
situation. Flight simulators ^for example, fall into this category. 

A good guide^JLine for item format, is that the item should be in 
the form that best approximates the behavior specified by ^e objec- ; 
tive. Tf the instruction is aimed at problem solving, for instance, 
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then the items should address problem solving tasks and not, for ex- 
ample, knowledge about required background content. If the instruc- 
tion is Intended to evaluate a particular performance, the items 
should be about evaluating tfiat performance, not actually performing 
•the tasksA It is also important that item styles not be widely mixed 
In a test, so as to avoid measuring test taking skill instead of 
subject-matter competence. • ^ 

Objectivity oX Measurement . Third, I would like to mention objec^ity 
of measurement .\ Each of Frederikseti' s categorie's described above, ap- 
pears to possess both advantages and disadvantages.' ^ Optimally, one 
wpuld hope to assess individual performance at the highest possible 
level of fidelity. Unfortunately, this may imply a subjective (rating) 
technique for a specific sit^J^^^IiTwhich then requires a subjectivity 
vs.* fidelity tradeoff. In order to minimize subjectivity in a real 
life situation, it may be necessary to decrease the level of fidelity 
so that moi'e objective measurem^ejits (such as time' and errors) can be 
obtained. Such a fidelity decrease czn/±n certain instances, be 
theoretically justified.' Presumab3.y, an actual increase in overall 
criterion adequacy may result from a gain in objectivity which com- 
pe^isates for ^ corresponding loss in fidelity. 

-In low fidelity performance testing situations, .such those 
using paper and pencil multiple-choice formats, objectivity in scoring 
is apparent— such tests can< for example,' be computer scored. In higher 
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fidelity testing situations, it is relatively simple to maximize 
objectivity in so-called "ha^d-skill" areas such as electronic main- 
tenance. In "soft-skill" areas, such as creativity, leadership, etc. 
objectivity in scoring is considerably -more difficult to achieve. 
To the extent that objectivity is not achieved, reliability is at- 

tenuated. « 

One suggested method of maximizing objectivity in "soft-skill" 
testing, is to require several examiners tp assess 'each individual. 



Inter-rater agreement can then be calculated . low inter-rater 
agreement .is found consistently, the test/should be revised. 
Scoring Problems . Fourth, allow me to mention scoring problems 
in the development of performance tests. The difficulties associ- 
ated with scoring performance tests, have been descrified'by so many 
for So long, that by now virtually everyone with an interest in this 
area knows that problems of ten "include expense, long administration 
times, apparatus which' may break down at inconvenient times, narrow 
applicability, unreliability, etc. Yet we must develop performance 
evaluation systems which minimize these difficulties while providing 
valid measures of perfbrmance. Two* scoring questions continually a- 
rise in performance testing. These concern product vs. process 
scoring, and the question of assistance vs. noti-interference. 

Products versus processes . Should "products" or "processes" be 
scored? Should the extent to which a "right answer" is obtained be 
measured, or should the extent to which the proper procedure was used 
be measured, regardless of the final result; or s^me combination of 

14 
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these? One way to score within-stage troubleshooting for example^ 
is to determine whether the subject is or is not able to identify^ 
defective x:omponent. This method scores only the product of trouble-- 
shoQting. If such a scoring scheme is used, it is difficult, if not 
impossible, to determine which of the many possible causes resulted in 

'failure to solve the problem. The subject may have made errors in the 
use of technical data; he may have made errors in the use of test equip- 

'ment; or he may have made logical errors in deciding where to make the 

check. Obs-ervation of the performance process may enable identification 

*^ 

of the causes for failure. 

Another area of concern in scoring produces alone, is that there 
may be only .a single task in the task category. If only the product 
of that task is observed, only a single measure is obtained on each 
subject for that task category. 

Finally, for some tasks ^ there is no product at the* end of the 
process. Checkout procedures for example, may include energizing the 
equipment to be checked, making all the required checks, and de^ner- 
gizing the equipment. If performance of the process is not measured, 
it is impossible to determine whether the procedure has been done 
correctly — and this is the primary item of interest. 

Three conditions under wAich processes should be scored in ad- 
dition to, or instead of, products are: When diagnostic information 
is required, when additional scores .are needed for a particular task, 
and when there is no product at the end of the process. For an ex- 
cellent discussion of process versus product, scoring, see Osborn (1973). 
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Assistance versus no assistance ; -Should an "assist" or "non- 
Interference" method of scoring be used? If the non-Interference 
method of scoring is used, serious dlstorti^ons of scores may result 
"when inexperienced students are tested. l\i some cases it may even 
be impossible to -find out how much of the tJ 
because many of the subtasks require proper 

steps. If the tester does not in some way assist the task performer 

I 

- in step 1, it may, in effect, be impossible to administer the test, 
even though the examinee may be able to perform all of the remaining 
steps. The intervention of the administrator! does Indeed Introduce 
distortion into th^ meaning of the test score! A slightly distorted 
score, however, is better than no score at alll If 'assists can be 
kept to a minimum, the distortion is likely. to\be relatively minor. 
Properly controlled, an assist apprpach can ind^d be used effectively. 
The nature of many activities is such that an assist method may be / 

' mandatory. 

Reliability and Validity .- ■ Finally , allow me to diicuss the areas 
of reliability and validity in criterion-referenced measurement. Per- 
sons who have completed an introductory course in psychometrics under- 
stand that the validity of a test cannot exceed its reliability. /But 

t^/what extent are these trad;ltional concepts applicable to criterion- 
^ </ ■ . 

re'f^renced testing? , ■ »^ 

f ^tleliability . ' Stanley (197^^as described techniques for apply-'" 
ing traditional reliability concepts as developed in norm-referenced 
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contexts, to criterion-referenced tests. S'ince criterion-referenced 
measures a're desig^ed^r- situations where discriminations among persons 
are of minimal importance,' traditional concepts of test^reliability are 
less applicable. This is the case since criterion-referenced measures 
are often used in situations having little ot no variation among true 
sco/es. However, .since the basic concept inyolved is to discriminate 
• indivTdJIIr^ation from a fixed criterion score, a criterion- 

i:eferenced test can give reliable scores even though the classically 
defined parallel forms reliability coefficient is low. 

A recent work fey Livingston (1972) has^ shown how classical con- . 
cepts of reliability can be applied to criterion-referenced measures. 
Basically, the procedure involves a redefinition of'variance, covari- 
ance and correlation in terms of deviation from a criterion, rather 
than from the mean. Livingston has also shbwn how other classical . 
norm-referenced reliability concepts, e.g. , correction for attenuation 
, ^.nd the Spearman-Brown formula, apply to criterion-referenced measure- 

'I' \ ^ 

'such techniques are,, for the most pa^t however, not fully. devel- 
'oped. (For example, see Oakland, 1972; Haladyna, 1974; and Woodson, . 
1974).- Tht need for additional work in the area of criterion-referenced 
reliability, continues to be 'a pressing one. 

A practical solution is to assess test-retest reliability of 
criterion-referenced tests; a procedure which does not depend on internal 
consistency, and'whioh increases 4he variability of the test results, be 
cause of the two test administrations required. The 0 coefficient 



Performance Effectiveness 
16 

is useful for analyzing the resulting four-fold (first administration- 
second administration, vs. pass-fail) data. It has elsewhere been- 
suggested (Swezey and Pearlstein, 1975) that 0 values of less than 
+.50 tend to indicate unacceptable test-retest reliability for criterion- 
referenced tests. 

Content validity . The process of determining performance cri- 
teria on the basis of information obtained directly from job required 
skills, defines a content-valid^criterion. Criterion tests which are 
derived from appropriate training .analyses provide the best available 
measure of behavioral objectives. No better criterion exists upon 
which to validate these instruments. 

Cronbach (1971)^ has treated the case of criterion-referenced con- 
tent validity in his discussion of performance testing. Content val- 
idity is a matter o§ the extent to which a test corresponds to the 
population performance objectives. Content validation can be viewed 
as absolute measurement, thus the score on a test suggests that an 
individual does or does not possess the abilities to adequately per- 
form the task. Cronbach uses the example of a dictated spelling test 
which, he says, is "a measure of hearing, and spelling Vocabulary and 
ability to write" (1971, p. 453). 

Content validity is alsci tempj^ry.^ Content valid itfens reflect 
behaviors, tasks, etc. which occur in the world today. These change 
with. the passage of time. It is necessary therefore, in .developing 
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objectives-oriented, criterion-referenced tests; that procedures be 
developed which insure tjiat a prospective user who follows the speci- 
fied procedure today, will arrive at a test reasonably like the job 
today. Th§ entire process may chinge tomorrow. 

Content validation, it is argued, is an especially appropriate 
method in criterion-referenced applications. A test is content valid 
if the test items are carefully based- on the performances, conditions, 
and standards specified ^n the objectives; and if the test items ap- 
propriately sample obje|(tives. (Of course, the objectives them- 
selves must be sound.)/ Thus, in most instances, careful test construc- 
tion will, itself, enile the development of content valid tests. 
However, in instances/where low fidelity tests are constructed, it . 
may be more difficult to derermine content vaUdity, since the items 
are not likely . to Je precisely matched to objectives. In such cases, 
there are two additional types of criterion-related validation that are 
well-suited to criterion-referenced measurement; concurrent validity 

and predictive validity. 

rnnrurrent validity . In detemiining concurrent validity, test re- 
sults are compared with an outside measure of the behaviors tested. 
^This outside measure must be the best available assessment of perfor- 
mance on the objective(s) in* question. The assessment^of concurrent 
validity, involves individual assessment via the test and the outside 
n^easure close together in time (concurrently). 0 again may be used on 
the four-fold data (CRT-other measure, vs. pass-fail)." r. ^ 
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Pred ictive •validl'ty . Performance prediction, using criterion-ref- 
erenced measures is no less practical or mdre difficult than is predic- 
tion using standard, norm-referenced measurement techniques. Although 
criterion-referenced scores^ are of ten of- the "go, no-go" variety, thfey 
can be employed as predictors of continuously measured criteria via 
point biserial and biserial techniques; and, of dichotomous standards 
via phi-coefficients and tetrachloric coefficients, (See McNemar, 1962 
for a discussion of -these techniques.) Predictive validity is a partic- 
ularly appropriate concept in the case of criterion-refernced- measerement. 

Predictive validity involves the same assumptions as does concurrent 
validity. The outside measure must be an accurate measure of the perfor- 
mance in question, or the validation will be meaningless. Predictive 
validity can be calculated the same way, except the outside measure is 
taken at a later time—i.e., when the i-ndividuals are actually performing 
the activity for which theyWe'been trained. 

Summary . This paper has ^tempted to present arid discuss some co- 
gent issues in the development of objectives-oriented, criterion-refer- 
exited measurement systems. The problems in these areas have not been 
solved by a long way. Much work remains. Nevertheless, it is suggested 
that domain-oriented and norm-referenced syst%?, while appropriate in ^ 
many situations are inappropriate or insufficient in others. Develop- • 
ment.of obj ectives-oriented , 'criterion-referenced .tests must, of neces- 
sity, proceed. TSuidance in how to construct sych tests is continually 
being developed and distributed. This guidance is based upon the best ^ 
available experience and the existing state-of-the-art. Yet many fun- 
damental questions remain. 
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